Add notebook for "Evaluating AI Search Engines with the `judges` Library" #270

freddiev4 · 2025-01-10T20:46:54Z

Description

This notebook showcases how to use judges—an open-source library for LLM-as-a-Judge evaluators—to assess and compare outputs from AI search engines like Gemini, Perplexity, and EXA.

This PR is a continuation of #257 -- shepherding the PR across!

What is judges?
judges is an open-source library that provides researched-backed, ready-to-use LLM-based evaluators for assessing outputs across various dimensions such as correctness, quality, and harmfulness. It supports both:

Classifiers (binary evaluations like True/False).
Graders (scored evaluations on numerical scales).

The library also provides an integration with litellm, allowing access to most open- and closed-source models and providers.

What This Notebook Does

Demonstrates how to use judges with litellm to evaluate AI search engine responses.
Uses LLaMA 3 (together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo) as the LLM evaluator to assess:
- Correctness (factual accuracy).
- Quality (clarity, helpfulness).
Provides a step-by-step workflow to evaluate outputs generated by search engines.

Open-Source Tools & Resources

Library: judges
Model: LLaMA 3.3 70B Instruct-Turbo via litellm
Dataset: Natural Questions Subset

Why This Notebook?

This notebook provides a practical example of using judges with an open-source model (LLaMA 3) to evaluate real-world AI outputs. It highlights the library's flexibility, ease of integration with litellm, and usefulness for benchmarking AI systems in a transparent, reproducible manner.

review-notebook-app · 2025-01-10T20:46:59Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

merveenoyan · 2025-01-12T19:30:11Z

I wish you hadn't made it into a new PR, it's harder to track our comments and following changes. Can you re-open your former PR and commit the files here to there instead so we see changes clearly?

freddiev4 · 2025-01-12T19:44:23Z

I wish you hadn't made it into a new PR, it's harder to track our comments and following changes. Can you re-open your former PR and commit the files here to there instead so we see changes clearly?

👋 @merveenoyan sorry about that! All of the commits from that PR are the same in this one except the most recent one. James won’t be able to finish up that PR for us so I needed to make a new one to ensure it gets the attention it needs — please let me know how else I can help make this smoother.

I’m happy to copy over the comments from the previous PR as well if that helps!

Otherwise, I think the only other option would be to open a PR -on top- of the other one, but you would need to merge as a repo owner since the PR was made by James and not me.

merveenoyan · 2025-01-13T09:51:25Z

notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb

@@ -0,0 +1,1680 @@
+{


you can use notebook_login instead

Reply via ReviewNB

merveenoyan · 2025-01-13T09:51:25Z

notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb

@@ -0,0 +1,1680 @@
+{


you could explain the error or ask users to ignore imo, otherwise it's confusing

Reply via ReviewNB

merveenoyan

I just left some nits, otherwise looks good! @stevhliu should review too

stevhliu · 2025-01-13T18:54:24Z

notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb

@@ -0,0 +1,1680 @@
+{


"...research-backed evaluator prompts..."

Reply via ReviewNB

stevhliu · 2025-01-13T18:54:24Z

notebooks/en/llm_judge_evaluating_ai_search_engines_with_judges_library.ipynb

@@ -0,0 +1,1680 @@
+{


It may be easier to consume this content in table-form

| Judge | What | Why | Source | When to use |

|---|---|---|---|---|

| PollMultihopCorrectness | | | | |

| PrometheusAbsoluteCoarseCorrectness | | | | |

| MTBenchChatBotResponseQuality | | | | |

Reply via ReviewNB

stevhliu

Thanks, just a few more comments and then we can merge! 🤗

stevhliu · 2025-01-13T18:55:06Z

notebooks/en/index.md

@@ -12,6 +12,7 @@ Check out the recently added notebooks:
 - [Fine-tuning SmolVLM using direct preference optimization (DPO) with TRL on a consumer GPU](fine_tuning_vlm_dpo_smolvlm_instruct)
 - [Smol Multimodal RAG: Building with ColSmolVLM and SmolVLM on Colab's Free-Tier GPU](multimodal_rag_using_document_retrieval_and_smol_vlm)
 - [Fine-tuning SmolVLM with TRL on a consumer GPU](fine_tuning_smol_vlm_sft_trl)
+- [Evaluating AI Search Engines with `judges` - the open-source library for LLM-as-a-judge evaluators](llm_judge_evaluating_ai_search_engines_with_judges_library)


I'd put this notebook at the top of the list since it's the most recent, and then remove "Fine-tuning SmolVLM with TRL on a consumer GPU" to keep the list tidy

James Liounis added 6 commits December 17, 2024 19:37

Add notebook: Evaluating AI search engines with the judges library

48ae926

deploy stevhliu fixes

1b98e78

add nb to toctree

c140910

add nb to index

8200bd9

reorganize nbs

c1d976f

add merveenoyan comments

85aee50

freddiev4 mentioned this pull request Jan 10, 2025

Add notebook: Evaluating AI search engines with the judges library #257

Closed

freddiev4 added 2 commits January 10, 2025 15:48

Merge branch 'main' into evaluating-search-engines

bde4e39

Update index.md

a757aa3

merveenoyan reviewed Jan 13, 2025

View reviewed changes

stevhliu reviewed Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add notebook for "Evaluating AI Search Engines with the `judges` Library" #270

Add notebook for "Evaluating AI Search Engines with the `judges` Library" #270

freddiev4 commented Jan 10, 2025

review-notebook-app bot commented Jan 10, 2025

merveenoyan commented Jan 12, 2025

freddiev4 commented Jan 12, 2025 •

edited

Loading

merveenoyan Jan 13, 2025 •

edited

Loading

merveenoyan Jan 13, 2025 •

edited

Loading

merveenoyan left a comment

stevhliu Jan 13, 2025 •

edited

Loading

stevhliu Jan 13, 2025 •

edited

Loading

stevhliu left a comment

stevhliu Jan 13, 2025

Add notebook for "Evaluating AI Search Engines with the judges Library" #270

Are you sure you want to change the base?

Add notebook for "Evaluating AI Search Engines with the judges Library" #270

Conversation

freddiev4 commented Jan 10, 2025

Description

What This Notebook Does

Open-Source Tools & Resources

Why This Notebook?

review-notebook-app bot commented Jan 10, 2025

merveenoyan commented Jan 12, 2025

freddiev4 commented Jan 12, 2025 • edited Loading

merveenoyan Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

merveenoyan Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

merveenoyan left a comment

Choose a reason for hiding this comment

stevhliu Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

stevhliu Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

stevhliu left a comment

Choose a reason for hiding this comment

stevhliu Jan 13, 2025

Choose a reason for hiding this comment

Add notebook for "Evaluating AI Search Engines with the `judges` Library" #270

Add notebook for "Evaluating AI Search Engines with the `judges` Library" #270

freddiev4 commented Jan 12, 2025 •

edited

Loading

merveenoyan Jan 13, 2025 •

edited

Loading

merveenoyan Jan 13, 2025 •

edited

Loading

stevhliu Jan 13, 2025 •

edited

Loading

stevhliu Jan 13, 2025 •

edited

Loading